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Abstract — We present several new results pertaining to 
haplotyping. These results concern the combinatorial prob- 
lem of reconstructing haplotypes from incomplete and/or 
imperfectly sequenced haplotype fragments. We consider 
the complexity of the problems Minimum Error Correction 
(MEC) and Longest Haplotype Reconstruction (LHR) for 
different restrictions on the input data. Specifically, we 
look at the gapless case, where every row of the input 
corresponds to a gapless haplotype-fragment, and the 1- 
gap case, where at most one gap per fragment is allowed. 
We prove that MEC is APX-hard in the 1-gap case and 
still NP-hard in the gapless case. In addition, we question 
earlier claims that MEC is NP-hard even when the input 
matrix is restricted to being completely binary. Concerning 
LHR, we show that this problem is NP-hard and APX-hard 
in the 1-gap case (and thus also in the general case), but 
is polynomial time solvable in the gapless case. 

Index Terms — Combinatorial algorithms, Biology and 
genetics, Complexity hierarchies 



I. Introduction 

If we abstractly consider the human genome as 
a string over the nucleotide alphabet {A, C, G, T}, 
it is widely known that the genomes of any two 
humans have at more than 99% of the sites the 
same nucleotide. The sites at which variability is 
observed across the human population are called 
Single Nucleotide Polymorphisms (SNPs), which 
are formally defined as the sites on the human 
genome where, across the human population, 
two or more nucleotides are observed and each 
such nucleotide occurs in at least 5% of the 
population. These sites, which occur (on average) 
approximately once per thousand bases, capture 
the bulk of human genetic variability; the string of 
nucleotides found at the SNP sites of a human - the 
haplotype of that individual - can thus be thought 
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of as a "fingerprint" for that individual. 

It has been observed that, for most SNP sites, only 
two nucleotides are seen; sites where three or four 
nucleotides are found are comparatively rare. Thus, 
from a combinatorial perspective, a haplotype can 
be abstractly expressed as a string over the alphabet 
{0, 1}. Indeed, the biologically-motivated field of 
SNP and haplotype analysis has spawned a rich 
variety of combinatorial problems, which are well 
described in surveys such as [6| and [9|. 

We focus on two such combinatorial problems, 
both variants of the Single Individual Haplotyping 
Problem (SIH), introduced in lfT51 . SIH amounts 
to determining the haplotype of an individual 
using (potentially) incomplete and/or imperfect 
fragments of sequencing data. The situation is 
further complicated by the fact that, being a 
diploid organism, a human has two versions of 
each chromosome; one each from the individual's 
mother and father. Hence, for a given interval 
of the genome, a human has two haplotypes. 
Thus, SIH can be more accurately described as 
finding the two haplotypes of an individual given 
fragments of sequencing data where the fragments 
potentially have read errors and, crucially, where 
it is not known which of the two chromosomes 
each fragment was read from. We consider two 
well-known variants of the problem: Minimum 
Error Correction (MEC), and Longest Haplotype 
Reconstruction (LHR). 

The input to these problems is a matrix M 
of SNP fragments. Each column of M represents 
an SNP site and thus each entry of the matrix 
denotes the (binary) choice of nucleotide seen 
at that SNP location on that fragment. An entry 
of the matrix can thus either be '0', '1' or a 
hole, represented by which denotes lack of 
knowledge or uncertainty about the nucleotide 
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at that site. We use M[i,j] to refer to the value 
found at row i, column j of M, and use M[i) 
to refer to the ith row. Two rows r 1 ,r 2 of the 
matrix conflict if there exists a column j such that 

M[ ri ,j]^M[r 2 ,j] andM[n,j),M[r 2 ,j] G {0,1}. 

A matrix is feasible iff the rows of the matrix 
can be partitioned into two sets such that all rows 
within each set are pairwise non-conflicting. 

The objective in MEC is to "correct" (or "flip") 
as few entries of the input matrix as possible (i.e. 
convert to 1 or vice-versa) to arrive at a feasible 
matrix. The motivation behind this is that all rows 
of the input matrix were sequenced from one 
haplotype or the other, and that any deviation from 
that haplotype occurred because of read-errors 
during sequencing. 

The problem LHR has the same input as MEC 
but a different objective. Recall that the rows of 
a feasible matrix M can be partitioned into two 
sets such that all rows within each set are pairwise 
non-conflicting. Having obtained such a partition, 
we can reconstruct a haplotype from each set by 
merging all the rows in that set together. (We 
define this formally later in Section [III]) With 
LHR the objective is to remove rows such that the 
resulting matrix is feasible and such that the sum 
of the lengths of the two resulting haplotypes is 
maximised. 

In the context of haplo typing, MEC and LHR 
have been discussed - sometimes under different 
names - in papers such as @, fT^I. [8| and 
(implicitly) fT~5l . One question arising from this 
discussion is how the distribution of holes in the 
input data affects computational complexity. To 
explain, let us first define a gap (in a string over the 
alphabet {0, 1, — }) as a maximal contiguous block 
of holes that is flanked on both sides by non-hole 

values. For example, the string 0010 

has no gaps, -0 — 10-111 has two gaps, and 

-0 1-- has one gap. Two special cases of 

MEC and LHR that are considered to be practically 
relevant are the ungapped case and the 1-gap case. 
The ungapped variant is where every row of the 
input matrix is ungapped, i.e. all holes appear at 
the start or end. In the 1-gap case every row has at 
most one gap. 



In Section Ill-AI we offer what we believe is the 
first proof that Ungapped-MEC (and hence 1-gap 
MEC and also the general MEC) is NP-hard. We 
do so by reduction from MAX-CUT. (As far as 
we are aware, other claims of this result are based 
explicitly or implicitly on results found in 1121 : as 
we discuss in Section III-CI we conclude that the 
results in lfT2ll cannot be used for this purpose.) 

The NP-hardness of 1-gap MEC (and general 
MEC) follows immediately from the proof 
that Ungapped-MEC is NP-hard. However, our 
NP-hardness proof for Ungapped-MEC is not 
approximation-preserving, and consequently tells 
us little about the (in)approximability of Ungapped- 
MEC, 1-gap MEC and general MEC. In light of 
this we provide (in Section ITl-BI) a proof that 1-gap 
MEC is APX-hard, thus excluding (unless P=NP) 
the existence of a Polynomial Time Approximation 
Scheme (PTAS) for 1-gap MEC (and general MEC.) 



We define (in Section III-CI) the problem Binary- 
MEC, where the input matrix contains no holes; 
as far as we know the complexity of this problem 
is still - intriguingly - open. We also consider a 
parameterised version of binary-MEC, where the 
number of haplotypes is not fixed as two, but is 
part of the input. We prove that this problem is 
NP-hard in Section Ill-DI (In the Appendix we 
also prove an "auxiliary" lemma which, besides 
being interesting in its own right, takes on a new 
significance in light of the open complexity of 
Binary-MEC.) 

In Section IIII-AI we show that Ungapped-LHR 
is polynomial-time solvable and give a dynamic 
programming algorithm for this which runs in time 
0(n 2 m + n 3 ) for an n x m input matrix. This 
improves upon the result of lfT51l which also showed 
a polynomial-time algorithm for Ungapped-LHR 
but under the restricting assumption of non-nested 
input rows. 

We also prove, in Section IIII-BI that LHR is 
APX-hard (and thus also NP-hard) in the general 
case, by proving the much stronger result that 
1-gap LHR is APX-hard. This is the first proof of 
hardness (for both 1-gap LHR and general LHR) 
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appearing in the literature. 1 

II. Minimum Error Correction (MEC) 

For a length-m string X E {0, 1, — } m , and a 
length-m string Y E {0, l} m , we define d(X,Y) as 
the number of mismatches between the strings i.e. 
positions where X is and Y is 1, or vice- versa; 
holes do not contribute to the mismatch count. 
Recall the definition of feasible from earlier; an 
alternative, and equivalent, definition (which we 
use in the following proofs) is as follows. An 
nx m SNP matrix M is feasible iff there exist two 
strings (haplotypes) Hi,H 2 E {0, l} m , such that 
for all rows r of M, d(r, Hi) = or d(r, H 2 ) =0. 

Finally, a flip is where a entry is converted 
to a 1, or vice-versa. Flipping to or from holes is 
not allowed and the haplotypes Hi and H 2 may 
not contain holes. 



A. Ungapped-MEC 

Problem: Ungapped-MEC 

Input: An ungapped SNP matrix M 

Output: Ungapped-MEC(M), which we define as 

the smallest number of flips needed to make M 

feasible. 2 

Lemma 1: Ungapped-MEC is NP-hard. 

Proof: We give a polynomial-time reduction 
from MAX-CUT, which is the problem of com- 
puting the size of a maximum cardinality cut in a 
graph. 3 Let G = (V, E) be the input to MAX-CUT, 
where E is undirected. (We identify, wlog, V with 
{1,2,...,|V|}.) We construct an input matrix M for 
Ungapped-MEC with 2k\V\ + \E\ rows and 2\V\ 
columns where k = 2\E\\V\. We use M to refer to 
the first k\V\ rows of M, Mi to refer to the second 
k\V\ rows of M, and Mq to refer to the remaining 
\E\ rows. M consists of |V| consecutive blocks of 
k identical rows. Each row in the i-th block (for 
1 < i < \V\) contains a at columns 2i — 1 and 2i 

'in 1 15 1 there is a claim, made very briefly, that LHR is NP-hard 
in general, but it is not substantiated. 

2 In subsequent problem definitions we regard it as implicit that 
P(I) represents the optimal output of a problem P on input /. 

3 The reduction given here can easily be converted into a Karp 
reduction from the decision version of MAX-CUT to the decision 
version of Ungapped-MEC. 
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/ 















1 1 



1 1 



110 1 

1 1 1 

1 1 

V 1 1 



1 






1 1 

1 1 / 



> 32 copies 



M, 



G 



Fig. n.2 

Construction of matrix M (from LemmaQJ for graph in 
Figure ITlTI 



and holes at all other columns. Mi is defined similar 
to M with 1-entries instead of 0-entries. Each row 
of Mq encodes an edge from E: for edge {i,j} 
(with i < j) we specify that columns 2i — 1 and 
2i contain Os, columns 2j — 1 and 2j contain Is, 
and for all h 7^ column 2h — 1 contains and 
column 2h contains 1. (See Figures ITTTTI and HO for 
an example of how M is constructed.) 

Suppose t is the largest cut possible in G and s is 
the minimum number of flips needed to make M 
feasible. We claim that the following holds: 

s=\E\(\V\-2) + 2(\E\-t). (HI) 

From this t, the optimal solution of MAX-CUT, 
can easily be computed. First, note that the solution 
to Ungapped-MEC(M) is trivially upperbounded 
by |V||J5|. This follows because we could simply 
flip every 1 entry in M G to 0; the resulting overall 
matrix would be feasible because we could just take 
Hi as the all-0 string and H 2 as the all-1 string. 
Now, we say a haplotype H has the double-entry 
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property if, for all odd-indexed positions (i.e. 
columns) j in H, the entry at position j of H is 
the same as the entry at position j + 1. We argue 
that a minimal number of feasibility-inducing flips 
will always lead to two haplotypes Hi , H 2 such 
that both haplotypes have the double-entry property 
and, further, Hi is the bitwise complement of 
H 2 . (We describe such a pair of haplotypes as 
partition-encoding.) This is because, if if i, if 2 are 
not partition-encoding, then at least k > | V 1 1 -E7 1 (in 
contrast with zero) entries in M and/or Mi will 
have to be flipped, meaning this strategy is doomed 
to begin with. 

Now, for a given partition-encoding pair of 
haplotypes, it follows that - for each row in M G - 
we will have to flip either |V| — 2 or |V| entries 
to reach its nearest haplotype. This is because, 
irrespective of which haplotype we move a row 
to, the \V\ — 2 pairs of columns not encoding 
end-points (for a given row) will always cost 1 flip 
each to fix. Then either 2 or of the 4 "endpoint- 
encoding" entries will also need to be flipped; 4 
flips will never be necessary because then the row 
could move to the other haplotype, requiring no 
extra flips. Ungapped-MEC thus maximises the 
number of rows which require |V| — 2 rather than 
|V| flips. If we think of Hi and H 2 as encoding a 
partition of the vertices of V (i.e. a vertex i is on 
one side of the partition if Hi has Is in columns 
2i — 1 and 2i, and on the other side if H 2 has Is in 
those columns), it follows that each row requiring 
\V\ — 2 flips corresponds to a cut-edge in the vertex 
partition defined by Hi and H 2 . The expression 
(IlLTT) follows. 



reduction 4 from CUBIC-MIN-UNCUT, which is 
the problem of finding the minimum number of 
edges that have to be removed from a 3-regular 
graph in order to make it bipartite. Our first goal 
is thus to prove the APX-hardness of CUBIC- 
MIN-UNCUT, which itself will be proven using an 
L-reduction from the APX-hard problem CUBIC- 
MAX-CUT. 

To aid the reader, we reproduce here the definition 
of an L-reduction. 

Definition 1: (Papadimitriou and Yannakakis 
ED) Let A and B be two optimisation problems. 
An L-reduction from A to B is a pair of functions 
R and S, both computable in polynomial time, such 
that for any instance I of A with optimum cost 
Opt(I), R(I) is an instance of B with optimum cost 
Opt(R(I)) and for every feasible 5 solution s of R(I), 
S(s) is a feasible solution of I such that: 

Opt(R(I)) < aOpt(I), (H.2) 

for some positive constant a and: 

\Opt(I) - c(S(s))\ < (3\Opt(R(I)) - c(s)\, (11.3) 

for some positive constant (3, where c(S(s)) and 
c(s) represent the costs of S(s) and s, respectively. 

Observation 1: CUBIC-MIN-UNCUT is APX- 
hard. 

Proof: We give an L-reduction from CUBIC- 
MAX-CUT, the problem of finding the maximum 
cardinality of a cut in a 3-regular graph. (This 
problem is shown to be APX-hard in [T); see also 
0.) Let G = (V, E) be the input to CUBIC-MAX- 
CUT. 



B. 1-gap MEC 
Problem: I -gap MEC 

Input: SNP matrix M with at most 1 gap per row 
Output: The smallest number of flips needed to 
make M feasible. 

To prove that 1-gap MEC is APX-hard (and 
therefore also NP-hard) we will give an L- 



Note that CUBIC-MIN-UNCUT is the 
"complement" of CUBIC-MAX-CUT, as expressed 
by the following relationship: 

CUBIC-MAX-CUT(G) 

= \E\ - CUBIC-MIN-UNCUT(G). ( ' 

4 An L-reduction is a specific type of approximation-preserving 
reduction, first introduced in |21 ]. If there exists an L-reduction from 
a problem X to a problem Y, then a PTAS for Y can be used to build 
a PTAS for X. Conversely, if there exists an L-reduction from X to 
Y, and X is APX-hard, so is Y. See (for example) 1 10 1 for a succinct 
discussion of this. 

5 Note that feasible in this context has a different meaning to 
feasible in the context of SNP matrices. 
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To see why this holds, note that for every cut C, 
the removal of the edges E \ C will lead to a 
bipartite graph. On the other hand, given a set of 
edges E' whose removal makes G bipartite, the 
complement is not necessarily a cut. However, 
given a bipartition induced by the removal of E', 
the edges from the original graph that cross this 
bipartition form a cut C, such that \C'\ > \E\E'\. 
This proves (111.41) . and the mapping (just described) 
from E' to C is the mapping we use in the 
L-reduction. 

Now, note that property (111.21) of the L-reduction is 
easily satisfied (taking a = 1) because the optimal 
value of CUBIC-MIN-UNCUT is always less than 
or equal to the optimal value of CUBIC-MAX- 
CUT. This follows from the combination of (111.41) 
with the fact that a maximum cut in a 3-regular 
graph always contains at least 2/3 of the edges: if 
a vertex has less than two incident edges in the cut 
then we can get a larger cut by moving this vertex 
to the other side of the partition. 

To see that property (III.3I) of the L-reduction 
is easily satisfied (taking (3 = 1), let E' be any 
set of edges whose removal makes G bipartite. 
Property (111.3ft is satisfied because E' gets mapped 
to a cut C, as defined above, and combined with 
(1114ft this gives: 

CUBIC-MAX-CUT(G) - \C'\ 

< CUBIC-MAX-CUT(G) -\E\E'\ (11.5) 

= \E'\ - CUBIC-MIN-UNCUT(G). 

This completes the L-reduction from CUBIC-MAX- 
CUT to CUBIC-MIN-UNCUT, proving the APX- 
hardness of CUBIC-MIN-UNCUT 

■ 

We also need the following observation. 

Observation 2: Let G = (V, E) be an undirected, 
3-regular graph. Then we can find, in polynomial 
time, an orientation of the edges of G so that 
each vertex has either in-degree 2 and out-degree 
1 ("in-in-out") or out-degree 2 and in-degree 1 
("out-out-in"). 

Proof: (We assume that G is connected; if G is 
not connected, we can apply the following argument 
to each component of G in turn, and the overall 
result still holds.) Every cubic graph has an even 



number of vertices, because every graph must have 
an even number of odd-degree vertices. We add an 
arbitrary perfect matching to the graph, which may 
create multiple edges. The graph is now 4-regular 
and therefore has an Euler tour. We direct the edges 
following the Euler-tour; every vertex is now in-in- 
out-out. If we remove the perfect matching edges 
we added, we are left with an oriented version of G 
where every vertex is in-in-out or out-out-in. This 
can all be done in polynomial time. 

■ 

Lemma 2: 1-gap MEC is APX-hard 

Proof: We give a reduction from CUBIC- 
MIN-UNCUT. Consider an arbitrary 3-regular 
graph G = (V, E) and orient the edges as described 
in Observation |2l to obtain an oriented version 
of G, G = (V, E), where each vertex is either 
in-in-out or out-out-in. We construct an \E\ x \V\ 
input matrix M for 1-gap MEC as follows. The 
columns of M correspond to the vertices of G and 
every row of M encodes an oriented edge of G; 
it has a in the column corresponding to the tail 
of the edge (i.e. the vertex from which the edge 
leaves), a 1 in the column corresponding to the 
head of the edge, and the rest holes. 

We prove the following: 

CUBIC-MIN-UNCUT(G) = 1-gap MEC(M). 

(H.6) 

We first prove that: 

1-gap MEC(M) < CUBIC-MIN-UNCUT(G). 

(E.7) 

To see this, let E' be a minimal set of edges whose 
removal makes G bipartite, and let \E'\ = k. Let 
B = (L U R, E \ E') be the bipartite graph (with 
bipartition L U R) obtained from G by removing 
the edges E'. Let Hi (respectively, H 2 ) be the 
haplotype that has Is in the columns representing 
vertices of L (respectively, R) and Os elsewhere. 
It is possible to make M feasible with k flips, by 
the following process: for each edge in E', flip 
the bit in the corresponding row of M to 1. For 
each row r of M it is now true that d(r, Hi) =0 or 
d(r, H 2 ) = 0, proving the feasibility of M. 

The proof that, 

CUBIC-MIN-UNCUT(G) < 1-gap MEC(M), 

(H.8) 
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is more subtle. Suppose we can render M feasible 
using j flips, and let Hi and H 2 be any two 
haplotypes such that, after the j flips, each row 
of M is distance from either Hi or H 2 . If Hi 
and H 2 are bitwise complementary then we can 
make G bipartite by removing an edge whenever 
we had to flip a bit in the corresponding row. The 
idea is, namely, that the Is in Hi (respectively, H 2 ) 
represent the vertices L (respectively, R) in the 
resulting bipartition LU R. 

However, suppose the two haplotypes Hi and 
H 2 are not bitwise complementary. In this case it 
is sufficient to demonstrate that there also exists 
bitwise complementary haplotypes H[ and H' 2 such 
that, after j (or fewer) flips, every row of M is 
distance from either H[ or H 2 . Consider thus a 
column of H\ and H 2 where the two haplotypes 
are not complementary. Crucially, the orientation 
of G ensures that every column of M contains 
either one 1 and two Os or two Is and one 
(and the rest holes). A simple case analysis shows 
that, because of this, we can always change the 
value of one of the haplotypes in that column, 
without increasing the number of flips. (The 
number of flips might decrease.) Repeating this 
process for all columns of Hi and H 2 where the 
same value is observed thus creates complementary 
haplotypes H[ and H 2 , and - as described in 
the previous paragraph - these haplotypes then 
determine which edges of G should be removed to 
make G bipartite. This completes the proof of (111.6b . 

The above reduction can be computed in polynomial 
time and is an L-reduction. From (111.61 ) it follows 
directly that property (111.21 ) of an L-reduction is 
satisfied with a — 1. Property (III .3b . with (3=1, 
follows from the proof of (III. 8b . combined with 
(111.61) . Namely, whenever we use (say) t flips to 
make M feasible, we can find s < t edges of G 
that can be removed to make G bipartite. Combined 
with (111.61) this gives: 

\CUBIC-MIN-UNCUT(G) - s\ 

< \l-gap MEC(M) - t\. ( -* 



C. Binary-MEC 

From a mathematical point of view it is 
interesting to determine whether MEC stays NP- 



hard when the input matrix is further restricted. Let 
us therefore define the following problem. 

Problem: Binary-MEC 

Input: An SNP matrix M that does not contain 
any holes 

Output: As for Ungapped-MEC 

Like all optimisation problems, the problem 
Binary-MEC has different variants, depending on 
how the problem is defined. The above definition 
is technically speaking the evaluation variant of the 
Binary-MEC problem 6 . Consider the closely-related 
constructive version: 

Problem: Binary-Constructive-MEC 

Input: An SNP matrix M that does not contain 

any holes 

Output: For an input matrix M of size nx m, two 
haplotypes Hi,H 2 e {0, l} m minimizing: 

D M (Hi, H 2 ) = min(rf(r, H x ), d(r, H 2 )). 

rows r of M 

(11.10) 

In the appendix, we prove that Binary-Constructive- 
MEC is polynomial-time Turing interreducible with 
its evaluation counterpart, Binary-MEC. This 
proves that Binary-Constructive-MEC is solvable 
in polynomial-time iff Binary-MEC is solvable in 
polynomial-time. We mention this correspondence 
because, when expressed as a constructive problem, 
it can be seen that MEC is in fact a specific 
type of clustering problem, a topic of intensive 
study in the literature. More specifically, we are 
trying to find two representative "median" (or 
"consensus") strings such that the sum, over all 
input strings, of the distance between each input 
string and its nearest median, is minimised. This 
interreducibility is potentially useful because we 
now argue, in contrast to claims in the existing 
literature, that the complexity of Binary-MEC / 
Binary-Constructive-MEC is actually still open. 

To elaborate, it is claimed in several papers 
(e.g. 0) that a problem equivalent to Binary- 
Constructive-MEC is NP-hard. Such claims 
inevitably refer to the seminal paper Segmentation 
Problems by Kleinberg, Papadimitriou, and 

6 See (3) for a more detailed explanation of terminology in this 
area. 
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Raghavan (KPR), which has appeared in multiple 
different forms since 1998 (e.g. iTLTI. f\l\ and 
ifm.) However, the KPR papers actually discuss 
two superficially similar, but essentially different, 
problems: one problem is essentially equivalent to 
Binary-Constructive-MEC, and the other is a more 
general (and thus, potentially, a more difficult) 
problem. 7 Communication with the authors 11201 
has confirmed that they have no proof of hardness 
for the former problem i.e. the problem that is 
essentially equivalent to Binary-Constructive-MEC. 

Thus we conclude that the complexity of Binary- 
Constructive-MEC / Binary-MEC is still open. 
From an approximation viewpoint the problem 
has been quite well- studied; the problem has a 
Polynomial Time Approximation Scheme (PTAS) 
because it is a special form of the Hamming 
2 -Median Clustering Problem, for which a PTAS 
is demonstrated in ifTTl. Other approximation 
results appear in O, Q- Ell, GUI and a 
heuristic for a similar (but not identical) problem 
appears in lfT9ll . We also know that, if the number 
of haplotypes to be found is specified as part 
of the input (and not fixed as 2), the problem 
becomes NP-hard; we prove this in the following 
section. Finally, it may also be relevant that the 
"geometric" version of the problem (where rows 
of the input matrix are not drawn from {0, l} m 
but from M. m , and Euclidean distance is used 
instead of Hamming distance) is also open from a 
complexity viewpoint fTSll . (However, the version 
using Euclidean-distance- squared is known to be 
NP-hard Q.) 

D. Parameterised Binary-MEC 

Let us now consider a generalisation of the 
problem Binary-MEC, where the number of 
haplotypes is not fixed as two, but part of the input. 

Problem: Parameterised-Binary-MEC (PBMEC) 
Input: An SNP matrix M that contains no holes, 

7 In this more general problem, rows and haplotypes are viewed 
as vectors and the distance between a row and a haplotype is their 
dot product. Further, unlike Binary-Constructive-MEC, this problem 
allows entries of the input matrix to be drawn arbitrarily from R. This 
extra degree of freedom - particularly the ability to simultaneously 
use positive, negative and zero values in the input matrix - is what 
(when coupled with a dot product distance measure) provides the 
ability to encode NP-hard problems. 



and keN\ {0} 

Output: The smallest number of flips needed to 
make M feasible under k haplotypes. 
The notion of feasible generalises easily to k > 1 
haplotypes: an SNP matrix M is feasible under k 
haplotypes if M can be partitioned into k segments 
such that all the rows within each segment are 
pairwise non-conflicting. The definition of Dm 
also generalises easily to k haplotypes; we define 
Dm,Ic(Hi, H 2 , H k ) as: 

min(d(r, H 1 ), d(r, H 2 ), d(r, H k )). 

rows r of M 

(11.11) 

We define OptTuples(M, k) as the set of unordered 
optimal A;-tuples of haplotypes for M i.e. those k- 
tuples of haplotypes which have a D Mjk score equal 
to PBMEC(M, k). 

Lemma 3: PBMEC is NP-hard 

Proof: We reduce from the NP-hard problem 
MINIMUM- VERTEX-COVER. Let G = (V, E) be 
an undirected graph. A subset V C V is said to 
cover an edge (u, v) G E iff u G V or v & V. A 
vertex cover of an undirected graph G = (V, E) is 
a subset U of the vertices such that every edge in 
E is covered by U. MINIMUM- VERTEX-COVER 
is the problem of, given a graph G, computing the 
size of a minimum cardinality vertex cover U of 
G. 

Let G = (V,E) be the input to MINIMUM- 
VERTEX-COVER. We construct an SNP matrix M 
as follows. M has \V\ columns and 3 1 £7 1 1 + \E\ 
rows. We name the first 3|J5||V| rows M and 
the remaining \E\ rows M G . M is the matrix 
obtained by taking the \V\ x \V\ identity matrix 
(i.e. Is on the diagonal, Os everywhere else) and 
making 3|.E| copies of each row. Each row in M G 
encodes an edge of G: the row has 1 -entries at the 
endpoints of the edge, and the rest of the row is 0. 
We argue shortly that, to compute the size of the 
smallest vertex cover in G, we call PBMEC(M, k) 
for increasing values of k (starting with k = 2) 
until we first encounter a k such that: 

PBMEC(M, k) = 3\E\(\V\ - (k - 1)) + \E\. 

(11.12) 

Once the smallest such k has been found, we 
can output that the size of the smallest vertex 
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Fig. II. 3 

Example input graph to MINIMUM-VERTEX-COVER (see 
Lemma[3J 
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cover in G is k — 1. (Actually, if we haven't 
yet found a value < |V| — 2 satisfying the 
above equation, we can check by brute force in 
polynomial-time whether G has a vertex cover of 
size \V\ —3, |V| —2, |V| — 1, or \V\. The reason for 
wanting to ensure that PBMEC(M, k) is not called 
with k > \V\ — 2 is explained later in the analysis. 8 ) 

It remains only to prove that (for k < \V\ — 2) 
dll.121) holds iff G has a vertex cover of size k — 1. 

To prove this we need to first analyze 
OptTuples(M , k). Recall that M was obtained 
by duplicating the rows of the \V\ x \V\ identity 
matrix. Let I\v\ be shorthand for the \V\ x \V\ 
identity matrix. Given that M is simply a "scaled 
up" version of I\ V \, it follows that: 

OptTuples{M , k) = OptTuples{I\ V \, k). (11.13) 

Now, we argue that all the /c-tuples in 
OptTuples{I\ V \,k) (for k < \V\ - 2) have 
the following form: one haplotype from the tuple 
contains only Os, and the remaining k— 1 haplotypes 
from the tuple each have precisely one entry set to 
1. Let us name such a fc-tuple a candidate tuple. 

First, note that PBMEC(I m ,k) < \V\ - {k - 1), 
because |V| — (k — 1) is the value of the D 
measure - defined in (III. 1 II ) - under any candidate 
tuple. Secondly, under an arbitrary A;-tuple there 
can be at most k rows of Ly\ which contribute 
to the D measure. However, if precisely k rows 

8 Note that, should we wish to build a Karp reduction from the 
decision version of MINIMUM- VERTEX-COVER to the decision 
version of PBMEC, it is not a problem to make this brute force 
checking fit into the framework of a Karp reduction. The Karp 
reduction can do the brute force checking itself and use trivial inputs 
to the decision version of PBMEC to communicate its "yes" or "no" 
answer. 



Fig. H.4 

Construction of matrix M for graph from Figure ITO 



of I\v\ contribute to the D measure (i.e. every 
haplotype has precisely one entry set to 1, and the 
haplotypes are all distinct) then there are \V\ — k 
rows which each contribute 2 to the D measure; 
such a /c-tuple cannot be optimal because it has 
a D measure of 2(|V| — k) > \V\ - (k - 1). So 
we reason that at most k — 1 rows contribute 
to the D measure. In fact, precisely k — 1 rows 
must contribute to the D measure because, 
otherwise, there would be at least \V\ — (k — 2) 
rows contributing at least 1, and this is not possible 
because PBMEC(I\ v \,k) < \V\ - (k — 1). So 
k — 1 of the haplotypes correspond to rows of L v \, 
and the remaining \V\ — (k — 1) rows of Ly\ must 
each contribute 1 to the D measure. But the only 
way to do this (given that \V\ — (k — 1) > 2) is to 
make the A;th haplotype the haplotype where every 
entry is 0. Hence: 

PBMEC(I\ V \,k) = \V\ - (k - 1) (11.14) 

and: 

PBMEC(M , k) = 3\E\(\V\ - (k - 1)). (11.15) 

OptTuples{I\ v \,k) (= OptTuples(M , k)) is, by 
extension, precisely the set of candidate A>tuples. 

The next step is to observe that OptTuples(M, k) C 
OptTuples(M , k). To see this, suppose (by way 
of contradiction) that it is not true, and there exists 
a A;-tuple H* G OptTuples(M, k) that is not in 
OptTuples(M , k). But then replacing H* by any 
A;-tuple out of OptTuples(M , k) would reduce 
the number of flips needed in M by at least 
3 1 .El, in contrast to an increase in the number of 
flips needed in Mq of at most 2\E\, thus leading 
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to an overall reduction in the number of flips; 
contradiction! (The 2\E\ figure is the number of 
flips required to make all rows in M G equal to the 
all-0 haplotype.) 

Because OptTuples(M, k) C OptTuples(M , k), 
we can restrict our attention to the /c-tuples in 
OptTuples(M , k). Observe that there is a natural 
1-1 correspondence between the elements of 
OptTuples(M 0l k) and all size k — 1 subsets of V: 
a vertex v 6 V is in the subset corresponding to 
H* e OptTuples(M , k) iff one of the haplotypes 
in H* has a 1 in the column corresponding to 
vertex v. 

Now, for a fc-tuple H* e OptTuples(M , k) 
we let Cov(G, H*) be the set of edges in G which 
are covered by the subset of V corresponding to 
H*. (Thus, \Cov(G,H*)\ = \E\ iff H* represents 
a vertex cover of G.) It is easy to check that, for 
H* e OptTuples(M ,k): 

D Mtk {H*) = Z\E\{\V\-{k-l)) 
+ \Cov(G, H*)\ 
+ 2(\E\ - \Cov(G, H*\) 
= 3\E\{\V\-(k-l)) 
+ 2\E\ - \Cov(G, H*)\. 

Hence, for H* e OptTuples{M , k), D Myk {H*) 
equals 3|£|(|V| — (k — 1)) + \E\ iff H* represents 
a size k — 1 vertex cover of G. 



III. Longest Haplotype Reconstruction 
(LHR) 

Suppose an SNP matrix M is feasible. Then 
we can partition the rows of M into two sets, M/ 
and M r , such that the rows within each set are 
pairwise non-conflicting. (The partition might not 
be unique.) From Mj (i e {/, r}) we can then build 
a haplotype Hi by combining the rows of M, ; as 
follows: The jth column of Hi is set to 1 if at least 
one row from Mj has a 1 in column j, is set to 
if at least one row from Mj has a in column j, 
and is set to a hole if all rows in Mj have a hole in 
column j. Note that, in contrast to MEC, this leads 
to haplotypes that potentially contain holes. For 
example, suppose one side of the partition contains 

rows 10--, -0-- and 1; then the haplotype 

we get from this is 10-1. We define the length 



of a haplotype H, denoted as \H\, as the number 
of positions where it does not contain a hole; the 
haplotype 10-1 thus has length 3, for example. 
Now, the objective with LHR is to remove rows 
from M to make it feasible but also such that the 
sum of the lengths of the two resulting haplotypes 
is maximised. We define the function LHR(M) 
(which gives a natural number as output) as the 
largest value this sum-of-lengths value can take, 
ranging over all feasibility-inducing row-removals 
and subsequent partitions. 

In Section IIII-AI we provide a polynomial-time 
dynamic programming algorithm for the ungapped 
variant of LHR, Ungapped-LHR. In Section Illl-BI 
we show that LHR becomes APX-hard and NP-hard 
when at most one gap per input row is allowed, 
automatically also proving the hardness of LHR in 
the general case. 

A. A polynomial-time algorithm for Ungapped-LHR 

Problem: Ungapped-LHR 

Input: An ungapped SNP matrix M 

Output: The value LHR(M), as defined above 

The LHR problem for ungapped matrices was 
proved to be polynomial-time solvable by Lancia 
et. al in lfT5ll . but only with the genuine restriction 
that no fragments are included in other fragments. 
Our algorithm improves this in the sense that 
it works for all ungapped input matrices; our 
algorithm is similar in style to the algorithm that 
solves MFR 9 in the ungapped case by Bafna et. 
al. in J4|. Note that our dynamic-programming 
algorithm computes Ungapped-LHR(M) but it can 
easily be adapted to generate the rows that must be 
removed (and subsequently, the partition that must 
be made) to achieve this value. 

Lemma 4: Ungapped-LHR can be solved in time 

0{n 2 m + n 3 ) 

Proof: Let M be the input to Ungapped-LHR, 
and assume the matrix has size n x m. For row 
i define as the leftmost column that is not a 
hole and define r(i) as the rightmost column that 

'Minimum Fragment Removal: in this problem the objective is not 
to maximise the length of the haplotypes, but to minimise the number 
of rows removed 
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is not a hole. The rows of M are ordered such 
that < if i < j. Define the matrix as 
the matrix consisting of the first i rows of M and 
two extra rows at the top: row and row —1, both 
consisting of all holes. Define W(i) as the set of 
rows j < i that are not in conflict with row i. 

For h,k < i and h, k > —1 and r{h) < r(k) 
define D[h,k;i] as the maximum sum of lengths 
of two haplotypes such that: 

• each haplotype is built up as a combination of 
rows from Mj (in the sense explained above); 

• each row from M, can be used to build at most 
one haplotype (i.e. it cannot be used for both 
haplotypes); 

• row k is one of the rows used to build a 
haplotype and among such rows maximises 

• row h is one of the rows used to build the 
haplotype for which k is not used and among 
such rows maximises r(-). 

The optimal solution of the problem, LHR(M), is 
given by: 

max D\h,k;n]. (HI.l) 

h,k\r(h)<r(k) 

This optimal solution can be calculated by starting 
with D[h, k,0] = for h,k e —1, and using the 
following recursive formulas. We distinguish three 
different cases, the first is that h, k < i. Under these 
circumstances: 

D[h,k;i] = D[h,k;i- 1]. (IH.2) 

This is because: 

• if r(i) > r(k): row i cannot be used for the 
haplotype that row k is used for, because row 
k has maximal r(-) among all rows that are 
used for a haplotype; 

• if r(i) < r{k): row i cannot increase the length 
of the haplotype that row k is used for (because 
also > l(k))\ 

• the same arguments hold for h. 

The second case is when h = i; D[i, k;i) is equal 
to: 

max D[j,k;i-l] + f{i,j). (IH.3) 

jew(i), j+k 

r(j)<r(i) 



Where f(i,j) = r(i) — max{r(j),l(i) — 1} is 
the increase of the haplotype's length. Equation 
(IIII.3I ) results from the following. The definition 
of D[i, k; i] says that row i has to be used for the 
haplotype for which k is not used and amongst 
such rows maximises r(-). Therefore, the optimal 
solution is achieved by adding row i to some 
solution that has a row j as the most-right-ending 
row, for some j that agrees with i, is not equal 
to k and ends before i. Adding row i to the 
haplotype leads to an increase of its length of 
f(hj) = r (0 — max{r(j), — 1}. This term 
is fixed, for fixed i and j and therefore we only 
have to consider extensions of solutions that were 
already optimal. Note that this reasoning does not 
hold for more general, "gapped", data. 

The last case is when k = i; D[h,i;i] is 
equal to: 

max f D\j,h;i - 1] +f(i,j) if r(h)>r(j), 
jew(i), &h\ D[h,j;i- 1] + f(ij) if r(h) < r(j). 

r(j)<r(i) 

The above algorithm can be sped up by using 
the fact that, as a direct consequence of (IIII.2I) . 

D[h, k; i] = D[h, k; max(h, k)} for all h, k < i < n. 
It is thus unnecessary to calculate the values 

D[h, k; i) for h,k < i. 

The time for calculating all the W(i) is 0(n 2 m). 
When all the W(i) are known, it takes 0(n 3 ) 
time to calculate all the D[h,k;max(h,k)]. This 
is because we need to calculate 0(n 2 ) values 
D[i,k;i] and also 0(n 2 ) values D[h,i;i] that take 
0(n) time each. This leads to an overall time 
complexity of 0(n 2 m + n 3 ). 

■ 

B. 1 -gap LHR is NP-hard and APX-hard 
Problem: 1-gap LHR 

Input: SNP matrix M with at most one gap per 
row 

Output: The value LHR(M), as defined earlier 

In this section we prove that 1-gap LHR is 
APX-hard (and thus also NP-hard.) We prove this 
by demonstrating (indirectly) an L-reduction from 
the problem CUBIC-MAX-INDEPENDENT-SET - 
the problem of computing the maximum cardinality 
of an independent set in a cubic graph - which is 
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itself proven APX-hard in JT). 

We do this in several steps. We first show an L- 
reduction from Single Haplotype LHR (SH-LHR), 
the version of LHR where only one haplotype is 
used 10 , to LHR, such that the number of gaps per 
rows is unchanged. We then show an L-reduction 
from CUBIC-MAX-INDEPENDENT-SET to 2-gap 
SH-LHR. Then, using an observation pertaining to 
the structure of cubic graphs, we show how this 
reduction can be adapted to give an L-reduction 
from CUBIC-MAX-INDEPENDENT-SET to 1-gap 
SH-LHR. This proves the APX-hardness of 1-gap 
SH-LHR and thus (by transitivity of L-reductions) 
also 1-gap LHR. 

Lemma 5: SH-LHR is L-reducible to LHR, such 
that the number of gaps per row is unchanged. 

Proof: Let M be the n x m input to SH-LHR. 
We may assume that M contains no duplicate 
rows, because duplicate rows are entirely redundant 
when working with only one haplotype. We map 
the SH-LHR input, M, to the 2n x m LHR input, 
M', by taking each row of M and making a copy 
of it. Informally, the idea is that the influence 
of the second haplotype can be neutralised by 
doubling the rows of the input matrix. Note that 
this construction clearly preserves the maximum 
number of gaps per row. 

Now, let SOL(M') be the set that contains all 
pairs of haplotypes (Hi, H 2 ) that can be induced 
by removing some rows of M', partitioning the 
remaining rows of M' into two mutually non- 
conflicting sets, and then reading off the two 
induced haplotypes. Similarly, let SOL(M) be 
the set that contains all haplotypes H that can be 
induced by removing some rows of M (such that 
the remaining rows are mutually non-conflicting) 
and then reading off the single, induced haplotype. 
Note the following pair of observations, which both 
follow directly from the construction of M': 

(H h H 2 ) G SOL(M') Hi,H 2 G SOL(M), 

(in.4) 

10 More formally:- rows of the input matrix M must be removed 
until the remaining rows are mutually non-conflicting. The length of 
the resulting single haplotype, which we seek to maximise, is the 
number of columns (amongst the remaining rows) that have at least 
one non-hole entry. 



H G SOL(M) (H, H) G SOL(M'). (111.5) 

To satisfy the L-reduction we need to show how 
elements from SOL(M') are mapped back to el- 
ements of SOL(M) in polynomial time. So, let 
(Hi,H 2 ) be any pair from SOL(M'). If \H X \ > 
\H 2 \ map the pair (Hi, H 2 ) to Hi, otherwise to H 2 . 
This completes the L-reduction, and we now prove 
its correctness. Central to this is the proof of the 
following: 

SH-LHR(M) = -LHR(M'). (III.6) 

The fact that SH- LHR( M) > \LHR(M') follows 
immediately from (IIII.4I) and the mapping described 
above. (This lets us fulfil condition III.2I) of the 
L-reduction definition, taking a = 2.) The fact that 
SH-L HR(M) < \LHR(M') follows because, by 
dffl.5l ). every element in SOL(M) is guaranteed to 
have a counterpart in SOL(M') which has a total 
length twice as large. 

We can fulfil condition dll.31 ) of the L-reduction 
by taking (3 = \. To see this, let (Hi,H 2 ) be 
any pair from SOL(M'), and (wlog) assume that 
\Hi\ > \H 2 \. Let r = LHR(M'), the distance of 
(Hi,H 2 ) from optimal is then: 

r-(\Hi\ + \H 2 \)>r-2\H 1 \. (HI.7) 

Let I = SH-LHR(M), then: 

I — \Hi\ = \~ l-^i I 

= ^-W) an* 

< \( r - + \H 2 \)\ 

Thus, taking f3 = \ satisfies condition dll.31) of the 
L-reduction. 

■ 

Lemma 6: 2-gap SH-LHR is APX-hard 

Proof: We reduce from CUBIC-MAX- 
INDEPENDENT-SET. Let G = (V,E) be 
the undirected, cubic input to CUBIC-MAX- 
INDEPENDENT-SET. We direct the edges of G 
in^the manner described by Observation^ to give 
G = (V, E). Thus, every vertex of G is now 
out-out-in or in-in-out. A vertex w is a child of 
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a vertex v if there is an ed|e leaving v in the 
direction of w i.e. (v,w) £ E 1 , and in this case i> 
is said to be the parent of w. 

Let t> iri be the number of vertices in G that 
are in-in-out, and v out be the number of vertices 
that are out-out-in. We build a matrix M, to be 
used as input to 2-gap SH-LHR, which has \V\ 
rows and 2v in + v out columns. The construction of 
M is as follows. (Each row of M will represent a 
vertex from V, so we henceforth index the rows 
of M using vertices of V .) Now, to each in-in-out 
vertex of G, we allocate two adjacent columns of 
M, and for each out-out-in vertex, we allocate one 
column of M. (A column may not be allocated to 
more than one vertex.) 11 For simplicity, we also 
impose an arbitrary total order P on the vertices of 
V. 

Now, for each vertex v £ V, we build row v 
as follows. Firstly, we put l(s) in the column(s) 
representing v. Secondly, consider each child w 
of v. If w is an out-out-in vertex, we put a in 
the column representing w. Alternatively, w is 
an in-in-out vertex, so w is represented by two 
columns; in this case we put a in the left such 
column (if v comes before the other parent of w 
in the total order P) or, alternatively, in the right 
column (if v comes after the other parent of w in 
the total order P). The rest of the row is holes. 

This completes the construction of M. Note that 
rows encoding in-in-out vertices contain two 
adjacent Is and one 0, with at most one gap in the 
row, and rows encoding out-out-in vertices contain 
one 1 and two 0s, with at most two gaps in the 
row. In either case there are precisely 3 non-hole 
elements per row. It is also crucial to note that, 
reading down any one column of M, one sees 
exactly one 1 and exactly one 0. 

Let K be any submatrix of M obtained by 
removing rows from M, and let V[K) C V be 
the set of vertices whose rows appear in K. 
If the rows of K are mutually non-conflicting, 
then the haplotype induced by K has length 
3r where r is the number of rows in K. This 



Note that, for this lemma, it is not important how the columns 
are allocated; in the proof of Lemma [5] the ordering is crucial. 




Fig. III.l 
Example input graph to 
CUBIC-MAX-INDEPENDENT-SET (SEE Lemmas|61and|7J 

AFTER AN APPROPRIATE EDGE ORIENTATION HAS BEEN APPLIED. 



10 

10 







1 1 - 



1 - 





-11 







1 1 
- o / 



^3 Vi V 2 V 5 V 5 V 7 V 8 V 8 t>4 V 4 V e V e 

«i / - 1 \ 

v 2 
v 3 

ve 
v 7 

v 8 \ 

Fig. III.2 

Construction of matrix M (from Lemma[6]and0i for 

GRAPH IN FlGURElTTT~T1 



follows from the aforementioned facts that every 
column of M contains exactly one 1 and one 0. 
and that every row has exactly 3 non-hole elements. 

We now prove that the rows of K are in conflict 
iff V[K\ is not an independent set. First, suppose 
V[K] is not an independent set. Then there exist 
u, v £ V[K] such that (u,v) £ E. In row v of K 
there are thus l(s) in the column(s) representing 
vertex v. However, there is also (in row u) a in 
the column (or one of the columns) representing 
vertex v, causing a conflict. Hence, if V[K] is not 
an independent set, K is in conflict. Now consider 
the other direction. Suppose K is in conflict. Then 
in some column of K there is a and a 1. Let u 
be the row where the is seen, and v be the row 
where the 1 is seen. So both u and v are in V[K\. 
Further, we know that there is an out-edge (u,v) 
in E, and thus an edge between u and v in E, 
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proving that V[K] is not an independent set. This 
completes the proof of the iff relationship. 

It follows that: 

CUBIC-MAX-INDEPENDENT-SET (G) 
= \SH-LHR{M). 

The conditions of the L-reduction definition are now 
easily satisfied, because of the 1-1 correspondence 
between haplotypes induced (after row-removals) 
and independent sets in G, and the fact that a size- 
r independent set of G corresponds to a length- 
3r haplotype (or, equivalently, to r mutually non- 
conflicting rows of M.) The L-reduction is formally 
satisfied by taking a = 3 and @ — |. The two 
functions that comprise the L-reduction are both 
polynomial time computable. 

■ 

Lemma 7: 1-gap SH-LHR is APX-hard. 

Proof: This proof is almost identical to the 
proof of Lemma the difference is the manner in 
which columns of M are assigned to vertices of G. 
The informal motivation is follows. In the previous 
allocation of columns to vertices, it was possible 
for a row corresponding to an out-out-in vertex to 
have 2 gaps. Suppose, for each out-out-in vertex, 
we could ensure that one of the Os in its row was 
adjacent to the 1 in the row, with no holes in 
between. Then every row of the matrix would have 
(at most) 1 gap, and we would be finished. We now 
show that, by exploiting a rather subtle property 
of cubic graphs, it is indeed possible to allocate 
columns to vertices such that this is possible. 

Assume, that we have ordered the edges of 
G as before to obtain G. Let V out C V be those 
vertices in V that are out-out-in. Now, suppose we 
could compute (in polynomial time) an injective 
function favourite : V out — > V with the following 
properties: 

• for every v G V out , (v, favourite(v)) E E; 

• the subgraph of G induced by edges of the 
form (v, favour ite(v)), henceforth called the 
favourite-induced subgraph, is acyclic. 

Given such a function it is easy to create a total 
enumeration of the vertices of V such that every 
out-out-in vertex is immediately followed by its 
favourite vertex. This enumeration can then be 



used to allocate the columns of M to the vertices 
of V, such that every row of M has at most one 
gap. To ensure this property, it is necessary to 
stipulate that, where favourite(v) is an in-in-out 
vertex, the encoding the edge (v, favourite(v)) 
is placed in the left of the two columns encoding 
favourite(v). This is not a problem because every 
vertex is the favourite of at most one other vertex. 

It remains to prove that the function favourite 
exists and that it can be constructed in polynomial 
time. This is equivalent to finding vertex disjoint 
directed paths in G such that every out-out-in vertex 
is on such a path and all paths end in an in-in-out 
vertex. Lemma |8] tells us how to find such paths. 
We thank Bert Gerards for invaluable help with this. 

This completes the proof that 1-gap SH-LHR 
is APX-hard. (See Figures 111111 and UTOI for an 
example of the whole reduction in action.) 

■ 

Lemma 8: Let G be a directed, cubic graph 
with a partition (V ou t, Kn) °f me vertices such that 
the vertices in V mt are out-out-in and the vertices 
in V in are in-in-out. Then V out can be covered, in 
polynomial time, by vertex-disjoint directed paths 
ending in V in . 

Proof: Observe that any two directed circuits 
contained entirely within V out are pairwise vertex 
disjoint. Let V' out be obtained from V out by shrinking 
each directed circuit in V out to a single vertex, and 
let G' be the resulting new graph. (Note that each 
vertex in V' out has outdegree at least 2 and indegree 
at most 1 and that the indegree of each node in 
Vi n is still 2, because we do not delete multiple 
edges) We now argue that it is possible to find a 
set of edges F' in G', with \F'\ = \V^ ut \, such that 
- for each v e V' out - precisely one edge from F' 
begins at v, and such that no two edges in F' have 
the same endpoint. We prove this by construction. 
For each vertex u G V' out that has a child v in V^ ut , 
we can add the edge (u, v) to F', because v has 
indegree 1 and therefore no other edges can end 
at v. (In case u has two such children, we can 
choose one of the edges to add to F'). Thus we 
are left to deal with a subset of vertices L C V' out 
where every vertex in L has all its children in V in . 
Now consider the bipartite graph B with bipartition 
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(L, V in ) and an edge for every directed edge of G' 
going from L to V in . If we can find a matching in 
B of size \L\, we can complete the construction of 
F 1 by adding the edges from the perfect matching. 
Hall's Theorem states that a bipartite graph with 
bipartition (X,Y) has a matching of size \X\ iff, 
for all X' C X, \N(X')\ > \X'\, where N(X') is 
the set of all neighbours of X'. Now, note that each 
vertex in L sends at least two edges across the 
partition of B, and each vertex in V in can accept 
at most two such edges, so for each V C L it is 
clear that \N(L')\ > \L'\. Hence, the graph (L, V in ) 
does indeed have a matching of size \L\ and the 
construction of F' can be completed. 

Now, given that the graph induced by V out is 
acyclic, so is F' . Let F be the set of edges in G 
corresponding to those in F'. F is acyclic and each 
directed circuit C in V out has exactly one vertex 
vc that is a tail of an edge of F and no vertex that 
is a head of an edge in F. Let Pc be the longest 
directed path in C that ends in vc- Then the union 
of F and all Pc over all directed circuits C in V out 
is a collection of paths ending in V in and covering 

►out- 

Finding cycles in a graph and finding a maximum 
matching in a bipartite graph are both polynomial- 
time computable, so the whole process described 
above is polynomial-time computable. 

■ 

Lemma 9: 1-gap LHR is APX-hard. 

Proof: Follows from Lemma 13 and Lemma 



IV. Conclusion 

This paper involves the complexity (under various 
different input restrictions) of the haplotyping 
problems Minimum Error Correction (MEC) and 
Longest Haplotype Reconstruction (LHR). The 
state of knowledge about MEC and LHR after 
this paper is demonstrated in Table U We also 
include Minimum Fragment Removal (MFR) 
and Minimum SNP Removal (MSR) in the table 
because they are two other well-known Single 
Individual Haplotyping problems. MSR (MFR) is 
the problem of removing the minimum number of 
columns (rows) from an SNP-matrix in order to 



make it feasible. 



MEC 


Binary (i.e. no holes) 


? (Section III-O 
PTAS known 1 1 1 1 


Ungapped 


NP-hard (Section 111- At 


1-Gap 


NP-hard (Section III-BI. 
APX-hard (Section PTE! 


LHR 


Ungapped 


P (Section IIII- A> 


1-Gap 


NP-hard (Section IIII-BI 
APX-hard (Section lilLBl 


MFR 


Ungapped 


P |4| 


1-Gap 


NP-hard 1151 
APX-hard |4| 


MSR 


Ungapped 


P |15| 


1-Gap 


NP-hard |4| 
APX-hard |4| 



TABLE I 



The new state of knowledge following our work 

Indeed, from a complexity perspective, the most 
intriguing open problem is to ascertain the 
complexity of the "re-opened" problem Binary- 
MEC. It would also be interesting to study the 
approximability of Ungapped-MEC. 

From a more practical perspective, the next 
logical step is to study the complexity of these 
problems under more restricted classes of input, 
ideally under classes of input that have direct 
biological relevance. It would also be of interest 
to study some of these problems in a "weighted" 
context i.e. where the cost of the operation in 
question (row removal, column removal, error 
correction) is some function of (for example) an a 
priori specified confidence in the correctness of the 
data being changed. 
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Appendix: Interreducibility of MEC and 

CONSTRUCTIVE-MEC 

Lemma 10: MEC and Constructive-MEC are 
polynomial-time Turing interreducible. (Also: 
Binary-MEC and Binary-Constructive-MEC are 
polynomial-time Turing interreducible.) 

Proof: We show interreducibility of MEC 
and Constructive-MEC in such a way that the 
interreducibility of Binary-MEC with Binary- 
Constructive-MEC also follows immediately from 
the reduction. This makes the reduction from 
Constructive-MEC to MEC quite complicated 
because we must thus avoid the use of holes. 

1. Reducing MEC to Constructive-MEC is trivial 
because, given an optimal haplotype pair (Hi, H 2 ), 
D M (Hi, H 2 ) can easily be computed in polynomial- 
time by summing min(d(Hi, r), d(H 2 , r)) over all 
rows r of the input matrix M. 

2. Reducing Constructive-MEC to MEC is 
more involved. To prevent a particular special case 
which could complicate our reduction, we first 
check whether every row of M (i.e. the input to 
Constructive-MEC) is identical. If this is so, we 
can complete the reduction by simply returning 
(Hi, Hi) where Hi is the first row of M. Hence, 
from this point onwards, we assume that M has at 
least two distinct rows. 

Let Opt Pair s(M) be the set of all unordered 
optimal haplotype pairs for M i.e. the set of all 
(Hi,H 2 ) such that D M (H 1 ,H 2 ) = MEC(M). 
Given that all rows in M are not identical, 
we observe that there are no pairs of the 
form (Hi, Hi) in OptPairs(M). 12 Let 
OptPairs(M,H') C OptPairs(M) be those 
elements (Hi,H 2 ) E OptPairs(M) such that 
Hi = H' or H 2 = H'. Let g(r, H u H 2 ) be defined 
as mm(d(r, Hi), d(r, H 2 )). 

Consider the following two subroutines: 

Subroutine: DFN ("Distance From Nearest 
Optimal Haplotype Pair") 

Input: An n x m SNP matrix M and a vector 

12 This is because Dm{H\, Hi) is always larger than DM(Hi,r) 
for any row r in M that is not equal to Hi. 



r e {0, l} m . 

Output: The value d^n which we define as 
follows: 

d d fn = , min g(r,H x ,H 2 ). 

(Hi,H z )eOptPairs{M) 

Subroutine: ANCHORED-DFN ('Anchored Dis- 
tance From Nearest Optimal Haplotype Pair") 
Input: An n x m SNP matrix M, a vector r E 
{0, l} m , and a haplotype H' such that (H ', H 2 ) E 
OptPairs(M) for some H 2 . 
Output: The value d a df n , defined as: 

d adfn = min g(r,Hi,H 2 ). 

(H 1 ,H 2 )&OptPairs(M,H') 

We assume the existence of implementations 
of DFN and ANCHORED-DFN which run 
in polynomial-time whenever MEC runs in 
polynomial-time. We use these two subroutines 
to reduce Constructive-MEC to MEC and then, 
to complete the proof, demonstrate and prove 
correcteness of implementations for DFN and 
ANCHORED-DFN. 

The general idea of the reduction from 
Constructive-MEC to MEC is to find some 
pair (Hi, H 2 ) E OptPairs(M) by first finding Hi 
(using repeated calls to DFN) and then finding 
H 2 (by using repeated calls to ANCHORED- 
DFN with Hi specified as the "anchoring" 
haplotype.) Throughout the reduction, the following 
two observations are important. Both follow 
immediately from the definition of D - i.e. dll.101) . 

Observation 3: Let Mi U M 2 be a partition 
of rows of the matrix M into two sets. 
Then, for all Hi and H 2 , D M (Hi,H 2 ) = 
D Ml (Hi,H 2 )+D M2 (Hi,H 2 ). 

Observation 4: Suppose an SNP matrix Mi 
can be obtained from an SNP matrix M 2 by 
removing or more rows from M 2 . Then 

MEC (Mi) < MEC(M 2 ). 

To begin the reduction, note that, for an 
arbitrary haplotype X, DFN(M, X) = iff 
(X, H 2 ) E OptPairs(M) for some haplotype H 2 . 
Our idea is thus that we initialise X to be all-0 and 
flip one entry of X at a time (i.e. change a to a 1 
or vice-versa) until DFN(M, X) = 0; at that point 
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X = H x (for some (H U H 2 ) e OptPairs{M).) 
More specifically, suppose DFN(M, X) = d where 
< d < m. 13 If we define flip(X, i) as the 
haplotype obtained by flipping the entry in the ith 
column of X, then we know that there exists i 
(1 < % < m) such that DFN(M, flip(X, i)) < d. 
Such a position must exist because we can flip 
some entry in X to bring it closer to the haplotype 
(which we know exists) that it was distance d 
from. It is clear that we can find a position i in 
polynomial-time by calling DFN(M, flip(X,j)) 
for 1 < j < m until it is found. Having found such 
an i, we set X = flip(X, i). 

Clearly this process can be iterated, finding 
one entry to flip in every iteration, until 
DFN(M, X) = and at this point setting 
Hi = X gives us the desired result. Given that 
DFN(M, X) decreases by at least 1 every iteration, 
at most m — 1 iterations are required. 

Thus, having found H\, we need to find some H 2 
such that (H U H 2 ) is in OptPairs(M). 

First, we initialise X to be the complement 
of Hi (i.e. the row obtained by flipping every 
entry of Hi). Now, observe that if X ^ Hi 
and ANCHORED-DFN(M, X, Hi) = then 
(Hi,X) E OptPairs(M) and we are finished. 
The tactic is thus to find, at each iteration, 
some position i of X such that ANCHORED- 
DFN(M, flip(X, i),Hi) is less than ANCHORED- 
DFN(M, X, Hi), and then setting X to be 
flip(X,i). As before we repeat this process until 
our call to ANCHORED-DFN returns zero. The 
"trick" in this case is to prevent X converging on 
Hi, because (knowing that M has at least two 
different types of row) (Hi, Hi) ^ OptPairs(M). 
The initialisation of X to the complement of Hi 
guarantees this. To see why this is, observe that, if 
X is the complement of Hi, d(X, Hi) = m. Thus, 
we would need at least m flips to transform X into 
H\. However, if X is the complement of Hi, then 
- because we have guaranteed that OptPairs(M) 
contains no pairs of the form (Hi, Hi) - we 
know that ANCHORED-DFN(M, X, Hi) < m. 



13 It is not possible that DFN(M, X) = m, because all (Hi,H 2 ) G 
OptPairs(M) are of the form Hi 7^ H 2 , and if Hi 7^ H2 we know 
that g(X, Hi, H 2 ) < m. 



Given that we can guarantee that ANCHORED- 
DFN(M, X, Hi) can be reduced by at least 1 at 
every iteration, it is clear that we can find an X 
such that ANCHORED-DFN(M, X, Hi) = after 
making no more than m — 1 iterations, which 
ensures that X cannot have been transformed into 
H\. Once we have such an X we can set H 2 = X 
and return (H\,H 2 ). 

To complete the proof of Lemma [TU] it remains 
only to demonstrate and prove the correctness 
of algorithms for DFN and ANCHORED-DFN, 
which we do below. Note that both DFN and 
ANCHORED-DFN run in polynomial-time if MEC 
runs in polynomial-time. 

Subroutine: DFN ("Distance From Nearest 
Optimal Haplotype Pair") 

Input: An n x m SNP matrix M and a vector 
r e {0,l} m . 

Output: The value ddf n which we define as 
follows: 



d, 



dfn 



mm 

(Hi,H 2 )€OptPairs(M) 



g(r,Hi,H 2 ) 



The following is a three-step algorithm to compute 
DFN(M,r) which uses an oracle for MEC. 

1. Compute d =MEC(M). 

2. Let M' be the n(m + 1) x m matrix obtained 
from M by making m + 1 copies of every row of 
M. 

3. Return MEC(Af U {r}) - (m + l)d where 
M' U {r} is the matrix obtained by adding the 
single row r to the matrix M' . 

To prove the correctness of the above we first 
make a further observation, which (as with the 
two previous observations) follows directly from 
(ITTTTOb . 

Observation 5: Suppose an kn x m SNP matrix 
Mi is obtained from an n x m SNP matrix 
M 2 by making k > 1 copies of every row of 
M 2 . Then MEC(M X ) = k.MEC(M 2 ), and 
OptPairs(Mi) = OptPairs(M 2 ). 

By the above observation we know that 
MEC(M') = (m + l)d and OptPairs(M') = 
OptPairs(M). Now, we argue that 
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OptPairs(M' U {r}) C OptPairs(M). To 
see why this is, suppose there existed (H 3 , H 4 ) 
such that (H 3 ,H A ) E OptPairs{M' U {r}) but 
(H 3 , H4) ^ OptPairs(M). This would mean 
D M (H 3 , H 4 ) > d where d =MEC(M). Now: 

DM'u{r}(H 3 , H 4 ) > D M i(H 3 ,H 4 ) 

= (m+l)D M (H 3 ,H A ) 
> (m + l)(d+l). 

However, if we take any (Hi, H 2 ) E OptPairs(M), 
we see that: 

D M , u{r} (Hi, H 2 ) <(m + l)d + g(r, H x , H 2 ) 
< (m + l)d + m. 

Now, (m + l)d + m < (m + l)(d + 1) so (H 3 , H 4 ) 
could not possibly be in OptPairs(M' U {r}) 
- contradiction! The relationship OptPairs(M' U 
{r}) C OptPairs(M) thus follows. It further 
follows, from Observation |3J that the members 
of OptPairs(M' U {r}) are precisely those pairs 
(Hi, H 2 ) E OptPairs(M) that minimise the 
expression g(r,Hi,H 2 ). The minimal value of 
g(r, Hi, H 2 ) has already been defined as ddf n , so 
we have: 

MEC(M' U {r}) = (m+ l)d + d d/n . 

This proves the correctness of Step 3 of the 
subroutine. 



OptPairs(M U {H'}). To prove the other direc- 
tion, suppose there existed some pair (Hi, H 2 ) E 
OptPairs(M U {H'}) such that H x ^ H' and 
H 2 7^ H'. But then, from Observation |3j we would 
have: 

Dmu{h>}(Hi, H 2 ) = D M (Hi, H 2 ) + g(H', Hi, H 2 ) 
>D M (Hi,H 2 ) + l 
> d. 

Thus, (Hi,H 2 ) could not have been 
in OptPairs(M U {H 1 }) in the first 
place, giving us a contradiction. Thus 
OptPairs(M U {H'}) C OptPairs(M, H') and 
hence OptPairs(M U {#'}) = OptPairs(M, H'), 
proving the correctness of subroutine ANCHORED- 
DFN. ■ 



Subroutine: ANCHORED-DFN ("Anchored 
Distance From Nearest Optimal Haplotype Pair") 
Input: An n x m SNP matrix M, a vector 
r G {0, l} m , and a haplotype iJ' such that 
(H 1 , H 2 ) E OptPairs(M) for some H 2 . 
Output: The value d a df n , defined as: 

d adfn = min g(r,Hi,H 2 ). 

Given that H' is one half of some optimal haplotype 
pair for M, it can be shown that ANCHORED- 
DFN (M, r, H') = DFN(M U {H'}, r), thus demon- 
strating how ANCHORED-DFN can be easily 
reduced to DFN in polynomial-time. To prove 
the equation it is sufficient to demonstrate that 
OptPairs(M U {H 1 }) = OptPairs(M, H'), which 
we do now. Let d =MEC(M). It follows that 
MEC(MU{#'}) > d. In fact, MEC(MU {H 1 }) = d 
because D Mu{H , } (H', H 2 ) = d for all (H',H 2 ) E 
OptPairs(M,H'). Hence OptPairs(M,H') C 



